53 research outputs found
OPR
The ability to reproduce a parallel execution is desirable for debugging and program reliability purposes. In debugging (13), the programmer needs to manually step back in time, while for resilience (6) this is automatically performed by the the application upon failure. To be useful, replay has to faithfully reproduce the original execution. For parallel programs the main challenge is inferring and maintaining the order of conflicting operations (data races). Deterministic record and replay (R&R) techniques have been developed for multithreaded shared memory programs (5), as well as distributed memory programs (14). Our main interest is techniques for large scale scientific (3; 4) programming models
DwarvesGraph: A High-Performance Graph Mining System with Pattern Decomposition
This paper presents DwarvesGraph, the first graph mining system that
decomposes the target pattern into several subpatterns, and then computes the
count of each. The results of the target pattern can be calculated using the
subpattern counts with very low additional cost. Despite decomposition-based
algorithms have been studied for years, we propose several novel techniques to
address key system challenges: 1) a partial-embedding-centric programming model
with efficient supports for pattern existence query and advanced graph mining
applications such as FSM; 2) an accurate and efficient cost model based on
approximate graph mining; 3) an efficient search method to jointly determine
the decomposition of all concrete patterns of an application, considering the
computation cost and cross-pattern computation reuse; and 4) the partial
symmetry breaking technique to eliminate redundant enumeration for each
subpattern while preserving equivalence of computation. Our experiments show
that DwarvesGraph is significantly faster than all existing state-of-the-art
systems and provides a novel and viable path to scale to large patterns
A Case for Reversible Coherence Protocol
We propose the first Reversible Coherence Protocol (RCP), a new protocol
designed from ground up that enables invisible speculative load. RCP takes a
bold approach by including the speculative loads and merge/purge operation in
the interface between processor and cache coherence, and allowing them to
participate in the coherence protocol. It means, speculative load, ordinary
load/store, and merge/purge can all affect the state of a given cache line. RCP
is the first coherence protocol that enables the commit and squash of the
speculative load among distributed cache components in a general memory
hierarchy. RCP incurs an average slowdown of (3.0%,8.3%,7.4%) on
(SPEC2006,SPEC2017,PARSEC), which is lower compared to (26.5%,12%,18.3%) in
InvisiSpec and (3.2%,9.4%,24.2%) in CleanupSpec. The coherence traffic overhead
is on average 46%, compared to 40% and 27% of InvisiSpec and CleanupSpec,
respectively. Even with higher traffic overhead (~46%), the performance
overhead of RCP is lower than InvisiSpec and comparable to CleanupSpec. It
reveals a key advantage of RCP: the coherence actions triggered by the merge
and purge operations are not in the critical path of the execution and can be
performed in the cache hierarchy concurrently with processor executio
GraphR: Accelerating Graph Processing Using ReRAM
This paper presents GRAPHR, the first ReRAM-based graph processing
accelerator. GRAPHR follows the principle of near-data processing and explores
the opportunity of performing massive parallel analog operations with low
hardware and energy cost. The analog computation is suit- able for graph
processing because: 1) The algorithms are iterative and could inherently
tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and
Collaborative Filtering) and typical graph algorithms involving integers (e.g.,
BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a
vertex program of a graph algorithm can be expressed in sparse matrix vector
multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We
show that this assumption is generally true for a large set of graph
algorithms. GRAPHR is a novel accelerator architecture consisting of two
components: memory ReRAM and graph engine (GE). The core graph computations are
performed in sparse matrix format in GEs (ReRAM crossbars). The
vector/matrix-based graph computation is not new, but ReRAM offers the unique
opportunity to realize the massive parallelism with unprecedented energy
efficiency and low hardware cost. With small subgraphs processed by GEs, the
gain of performing parallel operations overshadows the wastes due to sparsity.
The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x)
speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline
system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes
4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is
3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201
GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism
Communication is a key bottleneck for distributed graph neural network (GNN)
training. This paper proposes GNNPipe, a new approach that scales the
distributed full-graph deep GNN training. Being the first to use layer-level
model parallelism for GNN training, GNNPipe partitions GNN layers among GPUs,
each device performs the computation for a disjoint subset of consecutive GNN
layers on the whole graph. Compared to graph parallelism with each GPU handling
a graph partition, GNNPipe reduces the communication volume by a factor of the
number of GNN layers. GNNPipe overcomes the unique challenges for pipelined
layer-level model parallelism on the whole graph by partitioning it into
dependent chunks, allowing the use of historical vertex embeddings, and
applying specific training techniques to ensure convergence. We also propose a
hybrid approach by combining GNNPipe with graph parallelism to handle large
graphs, achieve better computer resource utilization and ensure model
convergence. We build a general GNN training system supporting all three
parallelism setting. Extensive experiments show that our method reduces the
per-epoch training time by up to 2.45x (on average 1.58x) and reduces the
communication volume and overhead by up to 22.89x and 27.21x (on average 8.69x
and 11.60x), respectively, while achieving a comparable level of model accuracy
and convergence speed compared to graph parallelism
Low-Cost Floating-Point Processing in ReRAM for Scientific Computing
We propose ReFloat, a principled approach for low-cost floating-point
processing in ReRAM. The exponent offsets based on a base are stored by a
flexible and fine-grained floating-point number representation. The key
motivation is that, while the number of exponent bits must be reduced due to
the exponential relation to the computation latency and hardware cost, the
convergence still requires sufficient accuracy for exponents. Our design
reconciles the conflicting goals by storing the exponent offsets from a common
base among matrix values in a block, which is the granularity of computation in
ReRAM. Due to the value locality, the differences among the exponents in a
block are small, thus the offsets require much less number of bits to represent
exponents. In essence, ReFloat enables the principled local fine-tuning of
floating-point representation. Based on the idea, we define a flexible ReFloat
format that specifies matrix block size, and the number of bits for exponent
and fraction. To determine the base for each block, we propose an optimization
method that minimizes the difference between the exponents of the original
matrix block and the converted block. We develop the conversion scheme from
default double-precision floating-point format to ReFloat format, the
computation procedure, and the low-cost floating-point processing architecture
in ReRAM
- …